========================================================

Instalation of packages requested and loading of its libraries

Intro to the dataset

Datasets to be analized corresponds to a white wine samples of different   variants of portuguese “Vinho Verde”. Inputs include objective tests (i.e. PH values) and the output is based on   sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between:   0 (very bad) and 10 (very excellent).

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10) # Missing Attribute Values: None

Univariate Plots Section

I’m going to start looking to the distributions of the white wine dataset. For this I am going to visualize the histograms of the different variables
of the file in order to check their distributions.

Data set info
## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The file contains data from 4898 white wines with numerical data of   12 parameters and one qualitative aspect and one aditional column with   the number of trails. Part of the features have outliers that far from the 3rd quartile in their   distributions (e.g.: fixed acidity, volatile acidity,   residual sugar, total sulfur). I will create a discrete value by transforming the quality punctuation   and I will include a new variable for rating the wines in bad (<5), average   (57)

Output variable information (Quality)

We can observe that most of the white wines in the list are considered as   “average quality”. Furtheron I will explore the data by creating histograms for each of the   12 variables (continous data). To see them better I will group them togheter.

As the distributions are skewe because most of variables have outliers I will   proceed in creating two histograms as follows:

  • as is data ( purple hystogram) with a red line for 95% quantile threshold
  • data without the upper outliers ( the blue histogram)
  • data without outliers when needed data is depicted with a log10 scale. Outliers are identified using the Inter Quartile method. Associated descriptive statistics are provided (when relevant).

Fixed acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The histogram had to be ploted in the log10 scale and the mean value is shown   above along with all the quartile stats. It looks like there are   two distributions for the residual sugar.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The histogram seems to have a normal distribution and the mean value is shown   above along with all the quartile stats.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

All the three plot styles have been used for the alcohol variable and then
log 10 sacle and the mean value is shown above along with all the quartile stats.

Univariate Analysis

What is the structure of your dataset?

The dataset is composed of 4898 registers of white wine. For each we have data 12 different characteristics or features of which
one is a categorical variable - discrete (quality).   From this variable I have created a new one clasyfing it into 3 categories   according its rating.   The remaining variables are physical and chemical properties   e.g. %of alcohol pH, acidity, density, etc.

What is/are the main feature(s) of interest in your dataset?

Quality is one of my main characteristic and the one that the consumer   juges by a wine but, on the other hand, the perception of the quality of   a wine is closely linked to its properties. As taste is one of the factors   to take in acount I will look also at Residual sugar and Alcohol percentage.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I will investigate the relationship between the quality with main   physical/chemical characteristics (acidity, content of sugar, pH, alcohol).   Density could have influence on the content of alcohol.

Did you create any new variables from existing variables in the dataset?

I have created: one new categorical variable called rate to classify   the wines into categories according to the quality value for each register.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

If we look at the shapes of the histograms we see all having similar   distributions except residual sugar and alcohol. The sacale is larger than   normal due to points outside the boxplot (outliers).

Bivariate Plots Section

First we install and load libraries to analyze relation between variables.

I will create a panel to analise relationship between the different variables.

Pairs matrix plot

When ploting the panels in the console and examining it with a higher   resolution in the Plot we can observe that the alghoritm works vey good as   it shows a high correlation (0.85) between Quality and Rate, the two identical   factors ( as you can remember rate has been created from quality). I have to try other plotting function as the correlation does not stand out   from a visual perspective. Therefore I will use the corrplot in order to have   in red and blue and with a higher font the correlations.

Correlation matrix plot

Now I can identify easily the highest positive correlarions   (greater than 0.45/-0.45) that are:residual.sugar and density,   free.sulfur.dioxide and density, total sulfur dioxide and density and   the negative correlations density and alcohol, total sulfur dioxide and   alcohol, residual suhar and alcohol. Therefore from now on my parameters of interest are: -residual.sugar -alcohol -density -free.sulfur.dioxide -total sulfur dioxide

Boxplots variables

Boxplot shows relationships between quality and variables.

Boxplots without outliers

Boxplot shows relationships between quality and variables.

Boxplots variables with stats

We can observe noticed that the lowest the Residual sugar is the higher   the evaluated quality and the highest concentration of alcohol the greater   the observed quality is. In terms of Sulfur Dioxide the level is in the middle   as the corelation is not so strong as the other factors. The lowest   the density is, the highest the evaluated quality.

Bivariate plots 1
## # A tibble: 6 <U+00D7> 4
##   quality alcohol_mean alcohol_median     n
##     <ord>        <dbl>          <dbl> <int>
## 1       3     10.34500          10.45    20
## 2       4     10.15245          10.10   163
## 3       5      9.80884           9.50  1457
## 4       6     10.57537          10.50  2198
## 5       7     11.36794          11.40   880
## 6       8     11.63600          12.00   175

New data frame has been created for alcohol and quality

Bivariate plots and stats 3

## # A tibble: 6 <U+00D7> 4
##   quality alcohol_mean alcohol_median     n
##     <ord>        <dbl>          <dbl> <int>
## 1       3     10.34500          10.45    20
## 2       4     10.15245          10.10   163
## 3       5      9.80884           9.50  1457
## 4       6     10.57537          10.50  2198
## 5       7     11.36794          11.40   880
## 6       8     11.63600          12.00   175

Boxplot shows how the alcohol variates regarding quality.

Correlation alcohol vs. density

## geom_smooth: na.rm = FALSE
## stat_smooth: na.rm = FALSE, method = lm, formula = y ~ x, se = TRUE
## position_identity
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Therefore we have displayed above all the metrics for the strongest   correlation including the confidence interval.

Correlation Density vs Residual Sugar

## geom_smooth: na.rm = FALSE
## stat_smooth: na.rm = FALSE, method = lm, formula = y ~ x, se = TRUE
## position_identity
## 
##  Pearson's product-moment correlation
## 
## data:  density and residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

Other strong correlation Residual sugar and density.

Scatter plot 3 parameters

The dispersion plots confirms the hypotesis made before but we can observe   that for Residual sugar and density we have a low sample tested. This can lead to error when making assumpltions and conclusions.

Scatter plot 3 parameters for extreme quality values

Plot of new dataframe with just with 1243 observations instead 4898   but mantaining the same 14 variables. Bivariate plot with 3 parameters that seem to have more influence   over quality: residual.sugar, alcohol, denisty.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

According to the correlation results and graphs given by pairs.panel and   corrplot, parameters that has highest corerelation with quality are: -residual.sugar -alcohol -density -free.sulfur.dioxide -total sulfur dioxide It is some how understandable that teh consumer apreciates a higher   alcoholic wine. The surprise for me was the density of the wine, the   lowest the highest the quality.

Therefore it seems that a high quality white wine is high in alcohol,   not sweet and not dense. Bare in mind as we haven’t conducted the experiment   we can’t imply that the correlation is a causation. I am issuing my conclusion   asuming here that the factors have been selected with a controlled experimient.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

As observed in the correlation matrix, in general we can see that sulfur   dioxide has an influence. Mantained in the median value provoques   a evaluation with a higer quality grade.

What was the strongest relationship you found?

The strongest relation (see pairs.panels) is given by the relation between   variable rate and quality because we create the first one from the values   of the second. Excluding this the next highest correlation is the positive   correlation between residual sugar and density as can be seen in the graph   corrplot and that I double checked with the Pearson correlation value (0,838).

Multivariate Plots Section

Regression for Three main parameters and Quality
## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and quality_lm
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality_lm
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
## 
##  Pearson's product-moment correlation
## 
## data:  density and quality_lm
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

With this selection of data we are not able to detect the styrong correlation   anymore. This was there all from the begining if I examine again   the correlation plot pair.panels. I see now clearly that between   all parameters and the output (quality or rate)   there is no correlation above 0.5.

What we can observe is correlation between the factors but not relate   to the evaluated quality.

Even so, I want to create 2D visualization in order to cluster were Excellent   and Bad wines are allocated in terms of the 3 parameters.   I will create visualization with the different combination   of the 3 characteristics

2D Density plots

The density plot shows that the best and separated combination for   the wine quality is alcohol and density.

2D Density plots histograms

The graphics show no separation between the different buckets of quality for   the 3 variables. Alcohol can be a good candidate to investigate further   due to a partial separation.

With this patterns I will create linear models to see if I can relate quality   with those 3 features.

Linear models
## 
## Calls:
## m1: lm(formula = I(quality_lm ~ residual.sugar), data = rw)
## m2: lm(formula = quality_lm ~ residual.sugar + alcohol, data = rw)
## m3: lm(formula = quality_lm ~ residual.sugar + alcohol + density, 
##     data = rw)
## 
## ====================================================
##                      m1         m2          m3      
## ----------------------------------------------------
##   (Intercept)      5.987***   2.021***   90.313***  
##                   (0.020)    (0.117)    (12.374)    
##   residual.sugar  -0.017***   0.022***    0.053***  
##                   (0.002)    (0.002)     (0.005)    
##   alcohol                     0.354***    0.246***  
##                              (0.010)     (0.018)    
##   density                               -87.886***  
##                                         (12.317)    
## ----------------------------------------------------
##   R-squared            0.0        0.2        0.2    
##   adj. R-squared       0.0        0.2        0.2    
##   sigma                0.9        0.8        0.8    
##   F                   47.1      619.4      434.1    
##   p                    0.0        0.0        0.0    
##   Log-likelihood   -6331.2    -5802.2    -5776.8    
##   Deviance          3804.4     3065.3     3033.7    
##   AIC              12668.4    11612.3    11563.6    
##   BIC              12687.9    11638.3    11596.1    
##   N                 4898       4898       4898      
## ====================================================

First of all, we look for the goodness of our model, and it’s not good with   values of R-square of 0 -0,2.

If we take a look to m-values, most important variable   (whitin this low correlation) in the model m3 with all 3 factors combined.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I started my multivariate study with the 3 vaiables that where corelated   among eachother. The correlation was strong between them but when looking   closer and making the 2D desnity plot we can’t observe a real separation   between the 2 popularion ( bad and excelent wines). This is confirmed by the analysis made with 3 models including   each of the features according to its correlation (Pearson method) with quality:  residual.sugar, alcohol, density.

When I do data analysis for problem solving I use this technique:   choose the best of the best register versus worst of the worst refisters. Sadly the investigation confirms that when using this technique there   is no strong correlation between these 3 variables and the wine quality.

Were there any interesting or surprising interactions between features?

The scatterplots do not show an interaction and when building the   linear model we can observe that this is confirmed by the R squared and P value.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I have created 3 lineal models begining just with 1 feature as predictor   and including in the next the an additional predictor according to its   correlation with quality; so first model includes only residual.sugar,   after I add alcohol, and then density also.

R-squawhite values is low (0,2) so, it will be difficult to adjust   it to a model. The issue is the data that contains quality evaluation   mainly for the average quality wines therefore makes it difficult   to give a finding about the high quality wines.


Final Plots and Summary

I have selected the following 3 plots

Plot One

This plot gives us an overview about the dataset.As we will discuss   furtheron about the quality of the white wine we need to identify how large   is aour data set and what is the range of the ratinngs provided for our data set.

This bar chart is the simplest grafic used but it lets us easily identify with   the colour palette that there considerably more average wines   evaluated than bad or excelent.

Plot Two

For the second plot I have selected the corrplot that gives us the correlation   between variables. The colours offer us the perspective of easly identifying   positive (blue) and negative (red) correlation. The size of the circles reflect   the magnitude of the correlation, the higher the diameter of the circle   highest the correlation.

Plot Three

The 2D density plots demonstrate that within the sample there is no   separation between bad and excelent wines in terms of quality. This means that input variables are correlated togheter buy unfortunately   not to the end result that is the utput variable - quality. We could start   an investigation for the alcohol because only a small part pf the histograms   overlap and we could assume that high alcohol wihite wines have a higher   quality if other variable is involved. The canditates to be studied is   low density and high alcohol values (they have a negative correlation of -0.76).

Reflection

After taking a look to our white wines dataset with 4898 registers and 14 wine   characteristics, I identified 3 that are strongly correlated together:   residual sugar, alcohol and density.

The correlation between the variables is strong, close to 1   (e.g.: residual sugar and density have a positive correlation of 0.84).

On the other hand I am surprised that the projects that i found on GitHub   and treat the white wine data all refer to a correlation of the white wine   variables (some of them) and the quality even though we can observe within   the correlation plot with a low correlation with the final output.

Taking in account the correlation between those characteristics I have made   a linear model that is not predicting very well because R-square value is   just 0,2 but allows us to confirm the conclusion.

In order more reliable results we should have had a continuous feature for   quality and therefore distinguish better between a wine with a 8.5 and one   of 9. It will be interesting to make the same calculation with the cleaned   dataset of what we identify with the boxplot as outliers and try with more   complex models instead linear ones to analyze the relation between   the 3 variables and the main output. For further investigation we will need   more data for bad and excellent ones in order to be able to have a prediction   algorithm ( this is one of the main reasons why I did not eliminate outliers).

I struggled to find a correlation between the variables provided and   the quality of the wine but I did not find one, even when creating   a linear model. I had issues at the begining with the packages   installation untill I understood that is better to install them in   the console and after just call the library. It took me a long time to put all the units of measures and titles and what   I find especially time consuming is to separate every string of text   that is longer than 80 characters. What I had whised to know from the start is the function .tabset, it made   the html more easy to be viewed and definitely with a better deign.

I consider myself successful in being able to learn how to explore   the data in R and the different possibilities to visualize and summarize   data provided. I discovered a software that is to my liking and more centered   on the statistic part of exploring data and with surprisingly   esthetic visualization.   I love the way it publishes the data easily in html and   it offers you also the free hosting.